{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TDC @ DGL GNN User Group Demo\n",
    "\n",
    "Therapeutics Data Commons (TDC) is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics. It provides support for the entire lifecycles of ML research for therapeutics. In this demo, we walk through key features of TDC. \n",
    "\n",
    "TDC Website: [tdcommons.ai](tdcommons.ai)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "TDC can be installed hassel-free via pip. The core package only requires minimal external dependencies. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install PyTDC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1: TDC Data Loaders\n",
    "\n",
    "TDC provides 3-lines of codes data loaders for 66 datasets in 22 tasks. Here, we demonstrate how to use the data loaders.\n",
    "\n",
    "Detailed documentations about each task and dataset can be found via our website: [https://tdcommons.ai/overview/](https://tdcommons.ai/overview/)\n",
    "\n",
    "### Sample Dataset: Single-instance Problem\n",
    "\n",
    "To retrieve the hERG dataset in the Tox task under the single-instance prediction problem: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading...\n",
      "100%|██████████| 50.2k/50.2k [00:00<00:00, 288kiB/s]\n",
      "Loading...\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Drug_ID</th>\n",
       "      <th>Drug</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DEMETHYLASTEMIZOLE</td>\n",
       "      <td>Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>GBR-12909</td>\n",
       "      <td>Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Drug_ID                                               Drug    Y\n",
       "0  DEMETHYLASTEMIZOLE  Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1  1.0\n",
       "1           GBR-12909  Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...  1.0"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc.single_pred import Tox\n",
    "data = Tox(name = 'hERG')\n",
    "data.get_data().head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This `data` class contains various utility functions, such as generating data split using various spliting schemes. Here, we use scaffold split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 655/655 [00:00<00:00, 1455.23it/s]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Drug_ID</th>\n",
       "      <th>Drug</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DELAVIRDINE</td>\n",
       "      <td>CC(C)Nc1cccnc1N1CCN(C(=O)C2=CC3=C[C@H](NS(C)(=...</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SOPHOCARPINE</td>\n",
       "      <td>O=C1C=CC[C@@H]2[C@H]3CCCN4CCC[C@H](CN12)[C@H]34</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Drug_ID                                               Drug    Y\n",
       "0   DELAVIRDINE  CC(C)Nc1cccnc1N1CCN(C(=O)C2=CC3=C[C@H](NS(C)(=...  0.0\n",
       "1  SOPHOCARPINE    O=C1C=CC[C@@H]2[C@H]3CCCN4CCC[C@H](CN12)[C@H]34  0.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "split = data.get_split(method = 'scaffold', \n",
    "                       seed = 42, \n",
    "                       frac = [0.7, 0.1, 0.2])\n",
    "split['train'].head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample Dataset: Multi-instance Problem\n",
    "\n",
    "Similarly, suppose we want to retrieve the GDSC1 dataset for drug response prediction under the multi-instance problem, you can do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading...\n",
      "100%|██████████| 140M/140M [00:07<00:00, 19.5MiB/s] \n",
      "Loading...\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Drug_ID</th>\n",
       "      <th>Drug</th>\n",
       "      <th>Cell Line_ID</th>\n",
       "      <th>Cell Line</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Erlotinib</td>\n",
       "      <td>COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...</td>\n",
       "      <td>MC-CAR</td>\n",
       "      <td>[3.23827250519154, 2.98225419469807, 10.235490...</td>\n",
       "      <td>2.395685</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Erlotinib</td>\n",
       "      <td>COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...</td>\n",
       "      <td>ES3</td>\n",
       "      <td>[8.690197905033282, 3.0914731119366, 9.9924871...</td>\n",
       "      <td>3.140923</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Drug_ID                                               Drug Cell Line_ID  \\\n",
       "0  Erlotinib  COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...       MC-CAR   \n",
       "1  Erlotinib  COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...          ES3   \n",
       "\n",
       "                                           Cell Line         Y  \n",
       "0  [3.23827250519154, 2.98225419469807, 10.235490...  2.395685  \n",
       "1  [8.690197905033282, 3.0914731119366, 9.9924871...  3.140923  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc.multi_pred import DrugRes\n",
    "data = DrugRes(name = 'GDSC1')\n",
    "data.get_data().head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample Dataset: Generation Problem\n",
    "Lastly, if we want to retrieve USPTO-50K dataset for retrosynthesis prediction under the generation problem, we can do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading...\n",
      "100%|██████████| 5.22M/5.22M [00:00<00:00, 5.49MiB/s]\n",
      "Loading...\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>input</th>\n",
       "      <th>output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>COC(=O)CCC(=O)c1ccc(OC2CCCCO2)cc1O</td>\n",
       "      <td>C1=COCCC1.COC(=O)CCC(=O)c1ccc(O)cc1O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1</td>\n",
       "      <td>COC(=O)c1cccc(C(=O)O)c1.Nc1cccnc1N</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                input                                output\n",
       "0  COC(=O)CCC(=O)c1ccc(OC2CCCCO2)cc1O  C1=COCCC1.COC(=O)CCC(=O)c1ccc(O)cc1O\n",
       "1  COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1    COC(=O)c1cccc(C(=O)O)c1.Nc1cccnc1N"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc.generation import RetroSyn\n",
    "data = RetroSyn(name = 'USPTO-50K')\n",
    "data.get_data().head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that all data are ML-ready!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2: TDC Data Functions\n",
    "\n",
    "TDC provides numerous data functions to support ML for therapeutics research! Checkout here for more info: [https://tdcommons.ai/fct_overview/](https://tdcommons.ai/fct_overview/).\n",
    "\n",
    "We have lots of evaluators for model performance evaluation and also various realistic splits. In addition, many data processing helpers are provided. For example, for molecule data, we provide the most accessible SMILES string, where you can transform it to a DGL object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using backend: pytorch\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[DGLGraph(num_nodes=28, num_edges=58,\n",
       "          ndata_schemes={}\n",
       "          edata_schemes={}),\n",
       " DGLGraph(num_nodes=31, num_edges=68,\n",
       "          ndata_schemes={}\n",
       "          edata_schemes={})]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc.chem_utils import MolConvert\n",
    "converter = MolConvert(src = 'SMILES', dst = 'DGL')\n",
    "converter(['Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)\\C(=O)OC',\n",
    "       'CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also call it when using data loader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Found local copy...\n",
      "Loading...\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Drug_ID</th>\n",
       "      <th>Drug</th>\n",
       "      <th>Drug_DGL</th>\n",
       "      <th>Y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DEMETHYLASTEMIZOLE</td>\n",
       "      <td>Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1</td>\n",
       "      <td>DGLGraph(num_nodes=33, num_edges=74,\\n        ...</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>GBR-12909</td>\n",
       "      <td>Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...</td>\n",
       "      <td>DGLGraph(num_nodes=33, num_edges=72,\\n        ...</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Drug_ID                                               Drug  \\\n",
       "0  DEMETHYLASTEMIZOLE  Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1   \n",
       "1           GBR-12909  Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...   \n",
       "\n",
       "                                            Drug_DGL    Y  \n",
       "0  DGLGraph(num_nodes=33, num_edges=74,\\n        ...  1.0  \n",
       "1  DGLGraph(num_nodes=33, num_edges=72,\\n        ...  1.0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc.single_pred import Tox\n",
    "data = Tox(name = 'hERG', convert_format = 'DGL')\n",
    "data.get_data().head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another example is for many multi-instance prediction problem, we can formulate the interaction prediction between two biomedical entities as a link prediction in a biomedical entity graph. We also provide helper funcitons to construct this graph from the raw data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading...\n",
      "100%|██████████| 26.3M/26.3M [00:01<00:00, 13.4MiB/s]\n",
      "Loading...\n",
      "Done!\n",
      "The dataset label consists of affinity scores. Binarization using threshold 30 is conducted to construct the positive edges in the network. Adjust the threshold by to_graph(threshold = X)\n"
     ]
    }
   ],
   "source": [
    "from tdc.multi_pred import DTI\n",
    "data = DTI(name = 'DAVIS')\n",
    "output = data.to_graph(threshold = 30, \n",
    "                       format = 'dgl', \n",
    "                       split = True, \n",
    "                       frac = [0.7, 0.1, 0.2], \n",
    "                       seed = 42, \n",
    "                       order = 'descending')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DGLGraph(num_nodes=379, num_edges=1474,\n",
       "         ndata_schemes={}\n",
       "         edata_schemes={})"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output['dgl_graph']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another big category of TDC is molecule generation. Molecule generation aims to generate new molecule that has ideal properties, which are measured by oracle functions. We provide 17 oracle functions. This can be formulated as a graph generation problems. \n",
    "\n",
    "You can access each oracle through 3 lines of code. For example, to access the Synthetic accessibility oracle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading Oracle...\n",
      "100%|██████████| 9.05M/9.05M [00:01<00:00, 5.23MiB/s]\n",
      "Done!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[2.706977149048555, 2.8548373344538067, 2.659973244931228]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tdc import Oracle\n",
    "oracle = Oracle(name = 'SA')\n",
    "oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \\\n",
    "        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \\\n",
    "        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3: TDC Leaderboard\n",
    "\n",
    "TDC leaderboard provides a place for fair model comparison for important therapeutics ML tasks. We devise the benchmarks to reflect the realistic drug discovery challenges.\n",
    "\n",
    "Here we demonstrate on using GNN to predict ADMET Property and submitting to TDC ADMET Caco2_Wang Benchmark."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Load the benchmark dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading Benchmark Group...\n",
      "100%|██████████| 1.47M/1.47M [00:00<00:00, 2.43MiB/s]\n",
      "Extracting zip file...\n",
      "Done!\n"
     ]
    }
   ],
   "source": [
    "from tdc import BenchmarkGroup\n",
    "group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')\n",
    "benchmark = group.get('Caco2_Wang')\n",
    "\n",
    "train_val, test = benchmark['train_val'], benchmark['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Train Your Models With Five Runs\n",
    "\n",
    "We use [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose), our latest framework for encoding compounds and proteins, as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "generating training, validation splits...\n",
      "100%|██████████| 728/728 [00:00<00:00, 1169.60it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Drug Property Prediction Mode...\n",
      "in total: 637 drugs\n",
      "encoding drug...\n",
      "unique drugs: 634\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 91 drugs\n",
      "encoding drug...\n",
      "unique drugs: 91\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 182 drugs\n",
      "encoding drug...\n",
      "unique drugs: 181\n",
      "do not do train/test split on the data for already splitted data\n",
      "predicting...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "generating training, validation splits...\n",
      "100%|██████████| 728/728 [00:00<00:00, 1295.03it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Drug Property Prediction Mode...\n",
      "in total: 637 drugs\n",
      "encoding drug...\n",
      "unique drugs: 635\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 91 drugs\n",
      "encoding drug...\n",
      "unique drugs: 90\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 182 drugs\n",
      "encoding drug...\n",
      "unique drugs: 181\n",
      "do not do train/test split on the data for already splitted data\n",
      "predicting...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "generating training, validation splits...\n",
      "100%|██████████| 728/728 [00:00<00:00, 1194.70it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Drug Property Prediction Mode...\n",
      "in total: 637 drugs\n",
      "encoding drug...\n",
      "unique drugs: 634\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 91 drugs\n",
      "encoding drug...\n",
      "unique drugs: 91\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 182 drugs\n",
      "encoding drug...\n",
      "unique drugs: 181\n",
      "do not do train/test split on the data for already splitted data\n",
      "predicting...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "generating training, validation splits...\n",
      "100%|██████████| 728/728 [00:00<00:00, 1260.67it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Drug Property Prediction Mode...\n",
      "in total: 637 drugs\n",
      "encoding drug...\n",
      "unique drugs: 634\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 91 drugs\n",
      "encoding drug...\n",
      "unique drugs: 91\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 182 drugs\n",
      "encoding drug...\n",
      "unique drugs: 181\n",
      "do not do train/test split on the data for already splitted data\n",
      "predicting...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "generating training, validation splits...\n",
      "100%|██████████| 728/728 [00:00<00:00, 1228.35it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Drug Property Prediction Mode...\n",
      "in total: 637 drugs\n",
      "encoding drug...\n",
      "unique drugs: 635\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 91 drugs\n",
      "encoding drug...\n",
      "unique drugs: 90\n",
      "do not do train/test split on the data for already splitted data\n",
      "Drug Property Prediction Mode...\n",
      "in total: 182 drugs\n",
      "encoding drug...\n",
      "unique drugs: 181\n",
      "do not do train/test split on the data for already splitted data\n",
      "predicting...\n"
     ]
    }
   ],
   "source": [
    "from DeepPurpose import CompoundPred as models\n",
    "from DeepPurpose.utils import data_process, generate_config\n",
    "\n",
    "drug_encoding = 'MPNN'\n",
    "prediction_runs = []\n",
    "\n",
    "for seed in [1, 2, 3, 4, 5]:\n",
    "    ### Generate Different Train, Valid Splits Given Seed ###\n",
    "    train, valid = group.get_train_valid_split(benchmark = benchmark['name'], split_type = 'default', seed = seed)\n",
    "    \n",
    "    ### Train the Model on Train, Valid Set ###\n",
    "    train = data_process(X_drug = train.Drug.values, y = train.Y.values, drug_encoding = drug_encoding, split_method='no_split')\n",
    "    val = data_process(X_drug = valid.Drug.values, y = valid.Y.values, drug_encoding = drug_encoding, split_method='no_split')\n",
    "    test = data_process(X_drug = benchmark['test'].Drug.values, y = benchmark['test'].Y.values, drug_encoding = drug_encoding, split_method='no_split')\n",
    "\n",
    "    config = generate_config(drug_encoding = drug_encoding, train_epoch = 3, LR = 0.001, batch_size = 128)\n",
    "    model = models.model_initialize(**config)\n",
    "    model.train(train, val, test, verbose = False)\n",
    "    \n",
    "    ### Generate Predictions on the Test Set ###\n",
    "    y_pred = model.predict(test)\n",
    "    prediction_runs.append({benchmark['name']: y_pred})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Evaluate the testing set prediction with pre-specified TDC evaluator\n",
    "\n",
    "The mean and standard deviation of the model performances are generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'caco2_wang': [1.221, 0.524]}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group.evaluate_many(prediction_runs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Copy the above results and submit to [THIS FORM](https://forms.gle/HYupGaV7WDuutbr9A).\n",
    "\n",
    "### That's it! Your results will be reflected on the [leaderboard website](https://tdcommons.ai/benchmark/admet_group/01caco2/) soon!\n",
    "\n",
    "We can see that model performance is far from perfect! There are many opportunities for algorithmic innovation! Submit your state-of-the-art models to TDC leaderboard!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Visit our website: [https://tdcommons.ai/](https://tdcommons.ai/)\n",
    "\n",
    "* Star our Github repo: [mims-harvard/TDC](https://github.com/mims-harvard/TDC)\n",
    "\n",
    "* Join our Slack Workspace: [https://rb.gy/bv0ffd](https://rb.gy/bv0ffd) \n",
    "\n",
    "We are looking for contributors!\n",
    "\n",
    "You can find this notebook at [https://github.com/mims-harvard/TDC/blob/master/tutorials/DGL_User_Group_Demo.ipynb](https://github.com/mims-harvard/TDC/blob/master/tutorials/DGL_User_Group_Demo.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:DeepPurpose]",
   "language": "python",
   "name": "conda-env-DeepPurpose-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
